library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.2.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
## ========================================
library(ComplexUpset)
library(ggrepel)
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(patchwork)
library(knitr)
library(grid)
library(png)
sample_table <- read_tsv('sampleTableFullV3.tsv') %>% filter(subtissue != 'synth')
## Parsed with column specification:
## cols(
## sample = col_character(),
## paired = col_character(),
## tissue = col_character(),
## subtissue = col_character(),
## origin = col_character(),
## body_location = col_character()
## )
load('clean_data/rdata/tissue_to_colors.Rdata')
load('clean_data/rdata/core_tight.Rdata')
tissue_color_mapping_df <- bind_rows(tissue_color_mapping_df, tibble(body_location=c('Brain(avg)', 'Body(avg)'), color=c('orange','yellow')))
load('clean_data/rdata/transcriptome_pipeline_stats.Rdata')
subtissue_to_bodyloc <- sample_table %>% select(subtissue, body_location) %>% distinct
core_tight <- core_tight %>% as_tibble %>% select(sample=sample_accession, study=study_accession) %>% distinct
load('clean_data/rdata/paper_numbers.Rdata')
load('clean_data/rdata/longread_summary.Rdata')
load('clean_data/rdata/buildResultsSummary.Rdata')
load('clean_data/rdata/novel_isoforms.Rdata')
load('clean_data/rdata/novel_loci_fig.Rdata')
load('clean_data/rdata/vep_results.Rdata')
color_list<- novel_transcripts_per_tissue$color
names(color_list) <- novel_transcripts_per_tissue$body_location
eye_tissues <- c("RPE_Fetal.Tissue", "Retina_Adult.Tissue", "RPE_Adult.Tissue", "Cornea_Adult.Tissue", "Cornea_Fetal.Tissue", "Retina_Fetal.Tissue")
subtissues <- unique(sample_table$subtissue)
body_tissues <- sample_table %>% filter(!subtissue%in% eye_tissues) %>% pull(subtissue) %>% unique
The detection of novel gene isoforms is a complicated task, but RNA-seq has been used as a powerful tool for investigating gene isoforms. Early methods using RNA-seq to detect gene isoforms generally focused solely on investigating the single RNA processing mechanism and determining what gene isoforms it was driving. For example, the computational tool rMATS detects novel gene isoforms by analyzing splicing patterns using RNA-seq(1). More recently, researchers have adapted methods from genome assembly to use RNA-seq to reconstruct the whole transcriptome of a biological samples, dubbed de novo transcriptome construction.(2) | De novo transcriptome construction uses short RNA-seq reads to reconstruct full-length mRNA transcripts. This is a particularly useful method because it is independent of any RNA processing mechanism. A major shortcoming of this method is that a large number of samples is required to combat the noisy nature of RNA-seq data, but because of the increasingly inexpensive sequencing, data sets of the necessary size are now available. The most comprehensive de novo transcriptome project to date has been CHESS, which used the GTEx dataset to construct de novo transcriptomes in over 9000 RNA-seq samples from 49 distinct location of the body to create a comprehensive annotation of mRNA transcripts across the human body. (3),(4),(5) However, as the GTEx dataset lacks any ocular tissues, the CHESS database is an incomplete annotation of the human transcriptome.
A major flaw of short read based de novo transcriptome construction is the inability to evaluate the accuracy of construction methods. Long read sequencing technologies provide as solution to this problems as long read sequencing captures entire full length transcripts and can be used to identify the full range of gene isoforms diveristy.<> While long reads have historicaly been considered inaccurate, the new PacBio sequel II system sequences long reads as accurately as short read based seqencing.<> We propose that long read based transcriptomes can serve as a ground truth for evaluating short-read base transcriptomes, and in this study we use PacBio long read RNA sequencing to inform the construction of short read transcriptomes. We first generate pacbio long read RNA seq data from a stem cell derived retinal pigmented epithelium(RPE) cell line along with illumina short read RNA-seq, and design a rigorous de novo transcriptome pipeline that maximizes the conordance between short and long read transcriptomes.
We apply this pipeline to construct a pan eye transcriptome using a previously published data set containing 368 ocular tissue samples compiled from mining publicly available sequencing data. (6). We use this pipeline build transcirptomes in three major ocular subtissues: The cornea, retina, and the retinal pigmented epithelium (RPE), using RNA-seq data from both adult and fetal tissues to create a high-quality pan-eye transcriptome. In addition to our ocular samples, we used a subset of the GTEx dataset to construct transcriptomes for 49 other locations across the body to facilitate comparisons in transcriptomes across the body. We use our pan eye de novo transcriptome to reveal hundreds of novel gene isoforms as well as several novel genes in the and analyze their potential impact on ocular biology and disease. We provide our de novo transcriptomes as a resource to other researchers through an R package and R-shiny webapp.
fig
Figure 0. DNTX workflow
PacBio hifi reads were processed into full length, non chimeric(FLNC) read using the Pacbio SMRT link software. We next adapted the Encode Long Read analysis pipeline : Transcripts were aligned to the human genome using the minimap2(7) with the following parameters. Aligned long reads were processed with TranscriptClean(8) to remove errors due to sequencing. The long read transcriptome was genreated with TALON (9)
We aligned each sample to the Gencode V28 hg38 assembly using the genomic aligner STAR and sorted the resulting BAM files using samtools sort.(10),(11),(12) For each sorted BAM file, we constructed a per sample base transcriptome using stringtie with the Gencode v28 comprehensive annotation as a guiding annotation and default parameters.(10),(3). All sample transcriptomes were merged with the long read transcriptome using gffcompare(13). We define the metric construction accuracy sued to evaluate short read transcriptome construction as the following: \(Construction\ Accuracy = \frac {short\ read \ transcriptome\ \cap \ long\ read\ transcriptome} {short\ read \ transcriptome}\)
Analysis of novel exons was done using a custom Rscript. To predict the clinical significance of clinvar variants of uncertain significance, we selected variants from the clinvar database labeled as “Uncertain Significance”, and used Ensembl’s variant effect preditor
Protein coding novel loci were compared to the uniprot protein sequence data base using blastp(15). Blastp results were integrated with hmmer, which identified protein families and domains associated with each novel loci
https://github.com/vinay-swamy/pacbio_testing - long read analysis pipeline, https://github.com/vinay-swamy/ocular_transcriptomes_paper - figures and table for this paper.
In order to determine the accuracy of short read transcriptome construction, we first generated PacBio long read RNA-seq data and illumina short read RNA-seq data from a stem cell derived RPE cell line. These cell lines are cultured using a highly optimized protocol, and thus shuold have minimal biological variation. We used this sequencing data to contruct a long read transcriptome and a short read transcriptome. In our long read transcriptome we find XXX distinct transcripts, and in our short read transcriptome YYY distinct transcripts
sum_counts <- length_df_plotting %>%
filter(intersection_case %in% c('stringtie','pacbio','stringtie-pacbio')) %>%
mutate(intersection_case = gsub('stringtie', 'StringTie', intersection_case),
intersection_case = gsub('pacbio', 'PacBio', intersection_case),
intersection_case = gsub('StringTie-PacBio', 'PacBio &\nStringTie', intersection_case)) %>% group_by(intersection_case) %>% count()
violin_plot <- length_df_plotting %>%
filter(intersection_case %in% c('stringtie','pacbio','stringtie-pacbio')) %>%
mutate(intersection_case = gsub('stringtie', 'StringTie', intersection_case),
intersection_case = gsub('pacbio', 'PacBio', intersection_case),
intersection_case = gsub('StringTie-PacBio', 'PacBio &\nStringTie', intersection_case)) %>%
ggplot() +
scale_y_continuous(breaks = c(2000,4000,6000, 8000, 10000)) +
geom_text(data = sum_counts, aes(x=intersection_case, y = 11000, label=n), stat= 'identity') +
geom_violin(aes(x=intersection_case,y=length)) + geom_hline(yintercept = 2000, color='red') + cowplot::theme_cowplot() + ylab('Transcript\nLength (bp)') + xlab('') + coord_flip(ylim = c(0,11500))
tpm_exp_plot <- ggplot(data = tpm_df_plotting %>% ungroup, aes(x=co, y=acc, color=`Length Interval`)) +
geom_point() +
geom_line()+
#xlim(c(-3,26))+
#ylim(c(0,1.3))+
#geom_label_repel(data = labdf_plotting%>% ungroup, aes(x=co, y=y, label = total_tx), fill='white', color='black', nudge_x = -1.5, nudge_y = .25) +
labs(label = 'asf')+
ylab('Build Accuracy\n(StringTie ∩ PacBio / StringTie)') +
xlab('TPM threshold') +
facet_wrap(~filtering_type) +
cowplot::theme_cowplot() +
scale_color_viridis_d()
violin_plot/tpm_exp_plot +plot_annotation(tag_levels = 'A') + plot_layout(heights = c(1,1.7))
Figure 1. Comparison of long read and short read fetal RPE transcriptomes. A) Intersection of transcript lengths between Pacbio(long read) and Stringtie(short read) transcriptomes. B) Short read contruction accuracy stratified by transcript length at different TPM based transcript exclusion thresholds.
df <- sample_table %>%
left_join(core_tight) %>%
filter(!body_location %in% c('Body', 'Brain', 'ESC_Stem.Cell.Line', 'Lens_Stem.Cell.Line')) %>%
group_by(subtissue) %>%
summarise(`Samples` = n(), `Studies` = length(unique(study))) %>%
arrange(desc(`Studies`)) %>%
inner_join(tx_counts %>% select(subtissue, `Transcriptome Count`=filtered)) %>%
mutate(subtissue=gsub('_|\\.',' ', subtissue)) %>%
mutate(subtissue = gsub(' Tissue', '', subtissue)) %>%
separate(subtissue, into = c('Tissue' , 'Source'), sep = ' ')
## Joining, by = "sample"
## Joining, by = "subtissue"
kable(df)
| Tissue | Source | Samples | Studies | Transcriptome Count |
|---|---|---|---|---|
| Retina | Adult | 105 | 8 | 38961 |
| RPE | Fetal | 49 | 7 | 41641 |
| Cornea | Adult | 43 | 6 | 43897 |
| Retina | Fetal | 89 | 6 | 50584 |
| RPE | Adult | 48 | 4 | 27877 |
| Cornea | Fetal | 6 | 2 | 57712 |
Table 1. Ocular sample dataset overview and transcriptome size. Transcriptome size is defined as the number of unique transcripts expressed in a given tissue type
Novel transcripts are split into two categories: novel isoforms, which are novel variations of known genes, and novel loci, which are previously unreported, entirely novel regions of transcribed sequence. Novel isoforms are further classified by the novelty of its encoded protein, ie novel ORF, a novel isoform with a known ORF, and isoforms with no orf as noncoding isoforms. The number of distinct ORFs is significanlty less than the number of transcripts, with 41201 previously annotated ORFs and 26593novel ORFs across all tissues. Across all tissues there is an average of 5585.9 novel isoforms and 1510.81 novel ORFs.
novel_isoforms_per_tissue <- inner_join(novel_transcripts_per_tissue, novel_orfs_per_tissue) %>%
select(-novel_pc_count) %>%
gather(key = 'transcript_type', value = 'counts', -body_location, -color) %>%
filter(!body_location %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line')) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=gsub('Tissue','', body_location_pretty),
body_location_pretty=gsub('avg','mean', body_location_pretty),
body_location_pretty=gsub('\\(',' (', body_location_pretty),
body_location_pretty=gsub('Adult','(Adult)', body_location_pretty),
body_location_pretty=gsub('Fetal','(Fetal)', body_location_pretty)) %>%
mutate(transcript_type = gsub('_',' ', transcript_type),
transcript_type = gsub(' count', '', transcript_type),
transcript_type = gsub('nc','NC', transcript_type),
transcript_type = gsub('orf','ORF', transcript_type),
transcript_type = gsub('novel','Novel', transcript_type),
transcript_type = gsub('ref','Ref.', transcript_type))
## Joining, by = c("body_location", "color")
novel_loci_per_tissue <- novel_loci_per_tissue %>%
gather(key = 'transcript_type', value = 'counts', -body_location, -color) %>%
filter(!body_location %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line')) %>%
mutate(transcript_type = gsub('_\\w+','', transcript_type),
transcript_type = gsub('noncoding' ,'Noncoding', transcript_type),
transcript_type = gsub('pc','Protein\nCoding', transcript_type)) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=gsub('avg','mean', body_location_pretty),
body_location_pretty=gsub('Tissue','', body_location_pretty),
body_location_pretty=gsub('\\(',' (', body_location_pretty),
body_location_pretty=gsub('Adult','(Adult)', body_location_pretty),
body_location_pretty=gsub('Fetal','(Fetal)', body_location_pretty))
color_list<- novel_isoforms_per_tissue$color
names(color_list) <- novel_isoforms_per_tissue$body_location_pretty
loci <- ggplot(data = novel_loci_per_tissue) +
geom_col(aes(x=body_location_pretty, y=counts, fill = body_location_pretty, alpha = transcript_type), position = 'dodge') +
scale_fill_manual(values = color_list, 'Body\nLocation') +
scale_alpha_discrete(range=c(.5,1), name = 'Transcript\nType') +
#ggtitle('Novel Isoforms Contructed Across the Body')+
ylab('Novel\nTranscript\nCount')+
xlab('Body Location')+
cowplot::theme_cowplot() +
theme(axis.text.x=element_blank(), panel.background = element_blank())
## Warning: Using alpha for a discrete variable is not advised.
isotype_per_tissue <- ggplot(data = novel_isoforms_per_tissue) +
geom_col(aes(x=transcript_type, y=counts, fill = body_location_pretty), position = 'dodge') +
scale_fill_manual(values = color_list) +
scale_alpha_discrete(range=c(.5,1)) +
#ggtitle('Novel Isoforms Contructed Across the Body')+
ylab('Noncoding Transcript Count')+
xlab('')+
cowplot::theme_cowplot() + theme(legend.position = "none") +
theme(axis.text.x=element_text(angle=45, hjust = 1), panel.background = element_blank())+
facet_wrap(~body_location_pretty, ncol=2)
## Warning: Using alpha for a discrete variable is not advised.
novel_transcript_types <- novel_isoform_anno_df %>% filter(label != 'Protein Coding') %>%
mutate(group = 'Transcript Type') %>%
rename(`Transcript Type` = label) %>%
ggplot(data = ., aes(x = group, fill = `Transcript Type`, y=count)) +
geom_bar(position = 'fill',stat = 'identity') +
geom_text(aes(label = count), position = position_fill(vjust=.5)) +
ylab('% of all novel transcripts') +
cowplot::theme_cowplot() +
theme(panel.background = element_blank(), axis.title.x = element_blank())
novel_exon_types <- exon_type_by_transcript_type %>% group_by(nv_type_rc) %>%
summarise(count = n()) %>%
mutate(group = 'Novel Exon Type') %>%
rename(`Novel Exon Type` = nv_type_rc) %>%
ggplot(data = ., aes(x = group, fill = `Novel Exon Type`, y=count)) +
geom_bar(position = 'fill',stat = 'identity') +
geom_text(aes(label = count), position = position_fill(vjust=.5)) +
ylab('% of all novel transcripts') +
cowplot::theme_cowplot() +
theme(panel.background = element_blank(), axis.title.x = element_blank())
exon_type_tx_plot <- exon_type_by_transcript_type %>%
mutate(comp_transcript_type = gsub('Isoform', 'Isoform,\n', comp_transcript_type)) %>%
mutate(nv_type_rc = gsub('_',' ', nv_type_rc),
nv_type_rc = gsub('novel', 'Novel', nv_type_rc)) %>%
group_by(comp_transcript_type, nv_type_rc) %>%
summarise(count = n()) %>%
ggplot(data=., aes(x = comp_transcript_type,y=count, fill = nv_type_rc)) +
geom_bar(position = 'fill', stat = 'identity')+
geom_text(aes(label = count), position = position_fill(vjust=.5)) +
ylab('Proportion of Transcripts') +
cowplot::theme_cowplot() +
theme(axis.text.x=element_text(angle=45, hjust = 1),
panel.background = element_blank(), axis.title.x = element_blank()) +
scale_fill_manual(values = pals::cols25() %>% unname(), name = 'Transcript\nCategory')
layout <- '
AB
AC
'
isotype_per_tissue + loci + exon_type_tx_plot +plot_annotation(tag_levels = 'A') +plot_layout(design = layout, heights = c(1,5), widths = c(1.5,1), guides = 'collect')
Figure 2. Overview of Novel Isoforms. A. Number of novel gene isoforms, grouped by transcript type. Brain and body represent an average of 13 and 34 distinct subtissues, respectively B. Novel protein coding and noncoding loci. Novel exon composition of novel isoforms, by isoform type labels indicate number of transcripts.
We found that the majority of novel exons with our dataset are novel first and last exons. We see that the majority of A5SS exons lead to novel protein coding isoforms, where as novel TES/TSS more often lead to noncoding isoforms. Retained introns mostly commonly lead to a novel isoform.
map_rate_diffs <- all_sample_mapping_rate_difference %>%
filter(!body_location %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line')) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location=case_when(body_location == 'Body' ~ 'Body(mean)',
body_location == 'Brain'~ 'Brain(mean)' ,
TRUE ~ body_location),
body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=gsub('avg','mean', body_location_pretty),
body_location_pretty=gsub('Tissue','', body_location_pretty),
body_location_pretty=gsub('\\(',' (', body_location_pretty),
body_location_pretty=gsub('Adult','(Adult)', body_location_pretty),
body_location_pretty=gsub('Fetal','(Fetal)', body_location_pretty),
mapping_rate_diff = mapping_rate_diff * -1 ,
mapping_rate_diff = replace(mapping_rate_diff, mapping_rate_diff <=0, 0),
pt = 'mapping rate reduction')
size_reduction_df <- inner_join(gencode_tx_size, tx_counts) %>%
mutate(percent_txome_size_decrease = ( all_exp-filtered) / all_exp) %>%
inner_join(subtissue_to_bodyloc) %>%
filter(!body_location %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line')) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location=case_when(body_location == 'Body' ~ 'Body(mean)',
body_location == 'Brain'~ 'Brain(mean)' ,
TRUE ~ body_location)) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=gsub('avg','mean', body_location_pretty),
body_location_pretty=gsub('Tissue','', body_location_pretty),
body_location_pretty=gsub('\\(',' (', body_location_pretty),
body_location_pretty=gsub('Adult','(Adult)', body_location_pretty),
body_location_pretty=gsub('Fetal','(Fetal)', body_location_pretty)) %>%
group_by(body_location_pretty) %>%
summarise(percent_txome_size_decrease = mean(percent_txome_size_decrease)) %>%
mutate(pt = '% Base txome removed')
## Joining, by = "subtissue"
## Joining, by = "subtissue"
cl<- novel_transcripts_per_tissue$color
names(cl) <- novel_transcripts_per_tissue$body_location
ggplot(map_rate_diffs) +
geom_boxplot(aes(x = pt, y = mapping_rate_diff, fill = body_location_pretty))+
geom_col( data = size_reduction_df, aes(x = pt, y=percent_txome_size_decrease, fill = body_location_pretty)) +
scale_fill_manual(values = color_list, ) +
theme(axis.text.x=element_text(angle=45, hjust = 1), panel.background = element_blank()) +
facet_wrap(~body_location_pretty, ncol = 2) +
ylab('fraction decrease') +
theme(legend.position = "none")
Figure 3. Comparison of Salmon mapping rate decrease vs transcriptome size decrease. Any gain in mapping rate has been set to 0 for simiplicity
primary_isoforms_per_tissue <- primary_isoforms_per_tissue %>%
mutate(primary_isoform_origin = gsub('dntx','DNTx', primary_isoform_origin),
primary_isoform_origin = gsub('gencode', 'GENCODE', primary_isoform_origin),
primary_isoform_origin = gsub('appris', 'APPRIS', primary_isoform_origin)) %>%
mutate(primary_isoform_origin = factor(primary_isoform_origin, levels = c('DNTx', 'GENCODE', 'APPRIS'))) %>%
inner_join(subtissue_to_bodyloc) %>%
filter(!body_location %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line')) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=case_when(body_location == 'Body' ~ 'Body(mean)',
body_location == 'Brain'~ 'Brain(mean)' ,
TRUE ~ body_location)) %>%
mutate(body_location_pretty=gsub('_|\\.', ' ', body_location),
body_location_pretty=gsub('avg','mean', body_location_pretty),
body_location_pretty=gsub('Tissue','', body_location_pretty),
body_location_pretty=gsub('\\(',' (', body_location_pretty),
body_location_pretty=gsub('Adult','(Adult)', body_location_pretty),
body_location_pretty=gsub('Fetal','(Fetal)', body_location_pretty)) %>%
group_by(body_location_pretty, primary_isoform_origin) %>%
summarise(count = mean(count)) %>%
arrange()
## Joining, by = "subtissue"
ggplot(data = primary_isoforms_per_tissue, aes(x = body_location_pretty, y=count, fill = primary_isoform_origin))+
geom_bar(position = 'fill', stat = 'identity')+
geom_text(aes(label = as.integer(count) ), position = position_fill(vjust=.5)) +
cowplot::theme_cowplot() +
theme(axis.text.x=element_text(angle=45, hjust = 1), panel.background = element_blank()) +
scale_fill_brewer(palette = 'Set1', name = 'Origin') + ylab('Ratio') + xlab('')
Figure 4. APPPRIS principal isoforms within the de novo transcriptome
source('scripts/R_raincloud.R')
plot_list <- novel_eye_tx_by_tissue[!names(novel_eye_tx_by_tissue) %in% c('Lens_Stem.Cell.Line', 'ESC_Stem.Cell.Line') ]
all_tx <- tibble(transcript_id = reduce(plot_list, union))
plot_df <- bind_cols(all_tx, lapply(plot_list, function(x) all_tx$transcript_id%in% x))
FIG_FONT_SIZE=12
colnames(plot_df) <- gsub('_|\\.', ' ', colnames(plot_df) )
colnames(plot_df) <- gsub('Adult', '(Adult)', colnames(plot_df) )
colnames(plot_df) <- gsub('Fetal', '(Fetal)', colnames(plot_df) )
colnames(plot_df) <- gsub('Tissue', '', colnames(plot_df) )
us <- ComplexUpset::upset(data = plot_df,
intersect = colnames(plot_df)[-1],
min_size= 500,
set_sizes = F,
#height_ratio = .8,
#width_ratio = .1,
wrap = T,
stripes = c('white','white'),
base_annotations = list(
'Intersection size' = intersection_size(text_colors = c(on_background='red', on_bar='red'))
),
themes=upset_modify_themes(
list(
"intersections_matrix"= theme(text = element_text(size=FIG_FONT_SIZE) ),
"Intersection size"= theme(text = element_text(size=FIG_FONT_SIZE),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()),
"overall_sizes"= theme(axis.text.x=element_text(angle=45),text = element_text(size=FIG_FONT_SIZE) ),
"default"= theme(text = element_text(size=FIG_FONT_SIZE))
)
)
)
plotting_piu_df <- piu_df %>% mutate(new_stage=ifelse(stage == 'adult', 'Adult', 'Fetal'),
subtissue=paste0(tissue, '_', new_stage, '.Tissue')) %>%
mutate(subtissue=gsub('_|\\.', ' ', subtissue),
subtissue=gsub('avg','mean', subtissue),
subtissue=gsub('Tissue','', subtissue),
subtissue=gsub('\\(',' (', subtissue),
subtissue=gsub('Adult','(Adult)', subtissue),
subtissue=gsub('Fetal','(Fetal)', subtissue))
piu <- ggplot(data=plotting_piu_df, aes(x=stage, y=piu, fill=subtissue, color=subtissue)) +
geom_flat_violin(position = position_nudge(x = .2, y = 0),adjust = 2)+
geom_point(position = position_jitter(width = .15), size = .25, alpha=.2) +
geom_boxplot(outlier.shape = NA, alpha = 1, width = .1, colour = "BLACK")+
scale_fill_manual(values = color_list) +
scale_color_manual(values = color_list) +
ylab('piu')+xlab('tissue')+coord_flip()+
cowplot::theme_cowplot()+ facet_wrap(~tissue, nrow = 3) +
theme(legend.position = "none")
us + piu + plot_layout(widths = c(2,1)) + plot_annotation(tag_levels = 'A') & theme(plot.tag = element_text(size= FIG_FONT_SIZE), text = element_text(size = FIG_FONT_SIZE))
Figure 5. Analysis of novel isoform characteristics A. Set intersection of novel isoforms in ocular transcriptomes. B Boxplots of PIU overlaid over PIU data points with estimated distribution of data set above each boxplot
We determined the longest ORF’s for each novel loci, and for each transcript with a valid ORF, predicted the function the encoded protein using hmmer searching within the pfam databse, and blastp searching within the swissprot database. We chose to hmmer results that had a sequence match e-value < 1e-4, and blastp results with e-value < 1e-15.(Sup Table) We found XXX distinct potentially novel coding loci from blast and hmmer combined. We highlight one detected in fetal RPE
library(patchwork)
cov <- ggplot(cov_df, aes(x=coord,y=cov, group = sample))+
geom_line( alpha = .3)+
geom_area(alpha = .3)+
#ggtitle(paste(txid, s_st))+
theme(axis.title.x = element_blank(), panel.background = element_blank(),
axis.text.x = element_blank(), axis.ticks.x = element_blank())
pt <- ifelse(gene_model_df$strand[1] == '+',62, 60)
model <- ggplot(gene_model_df)+
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax))+
geom_rect(xmin=min(gene_model_df$xmin), xmax=max(gene_model_df$xmax), ymin=-.1, ymax=.1)+
geom_point(data=point_df, aes(x = x, y=y), shape= pt, size = 5)+
xlab('genome') +
ylab(element_blank())+
theme(panel.background = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
layout <- '
AA
AA
AA
AA
AA
AA
BB
CC
CC
'
mss <- as_ggplot(rasterGrob(readPNG('clean_data/plots/msa_ex.png'), interpolate = T))
cov/model/mss + plot_layout(design = layout) +plot_annotation(tag_levels = 'A')
Figure 6. DNTX000dddd, potentially protein coding novel isoform in Fetal RPE. A) RNA-seq coverage of exons across 49 samples. B transcript model of DNTX000ddd. thick regions represtn coding region. C. Multiple sequence alignment of DNTX000ddd and closely related proteins.
Placeholder figure, still trying to find the right transcript to look at here.
The prediction of variants’s biological impact is a fundamental step in variant prioritzation, especially in the diagnosis of genetic diseases or in interpreting GWAS results. We use our Ensembl’s Variant Effect Predictor combined with our tissue transcriptomes to generate a tissue specific prediction of variant impact for variants of unknown significance(VUS) the clinvar database. We grouped different predicted consequences into a High, moderate and low impacts, and provide a full mapping of biological consequence to impact in our supplementary data.
n=10
n=10
vep_exp_all_eye <- filter(dntx_vep_impact_matrix, allele_id %in% clinvar_vus_eye$allele_id) %>%
filter(rowSums(.[,eye_tissues] == 3) >=1, rowSums(.[,body_tissues] !=3) ==length(body_tissues))
vep_no_eye <- filter(dntx_vep_impact_matrix, allele_id %in% clinvar_vus_eye$allele_id) %>%
filter(rowSums(.[,eye_tissues] != 3) == length(eye_tissues),three_count >4 )
impact_diff_changes <- impact_diff %>% filter(min_diff!=0 | max_diff!=0)
eye_impact_increase <- impact_diff_changes %>% filter(max_diff>0, allele_id%in% clinvar_vus_eye$allele_id)
vep_impact_increase <- filter(dntx_vep_impact_matrix, allele_id %in% eye_impact_increase$allele_id)
vep_example <- bind_rows( vep_exp_all_eye %>% arrange(desc(var))%>%
#.[1:n,] %>%
select(allele_id, eye_tissues, everything()),
vep_no_eye %>% arrange(desc(var)) %>%
#.[1:n,] %>%
select(allele_id, eye_tissues, everything()) ,
vep_impact_increase
)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(eye_tissues)` instead of `eye_tissues` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
clinvar_example <- filter(clinvar_vus_eye, allele_id %in% vep_example$allele_id) %>% select(allele_id, contains('pheno'))
#vep_example <- filter(vep_impact_matrix, allele_id %in% clinvar_vus_eye$allele_id) %>% arrange(desc(var))
gencode_vep_by_tissue_simple <- gencode_vep_by_tissue %>%
mutate(subtissue = 'gencode', max_impact_avg_exp=1, max_impact_med_exp=1) %>%
distinct()
merge_bt <- bind_rows(dntx_vep_by_tissue %>% mutate(build= 'dntx'),
gencode_vep_by_tissue_simple%>% mutate(build= 'gencode'))
plot_df <- merge_bt %>%
filter(allele_id %in% vep_example$allele_id) %>%
mutate(avg_exp = log2(max_impact_avg_exp+1),
med_exp = log2(max_impact_med_exp + 1),
avg_exp = replace(avg_exp , avg_exp >5, 5),
allele_id = as.character(allele_id),
`Variant Impact` = case_when(max_impact == 1 ~'Low',
max_impact == 2 ~ 'Moderate',
max_impact == 3 ~ 'High') %>% factor(levels=c('Low', 'Moderate', 'High')),
subtissue = factor(subtissue, levels = c('gencode',eye_tissues, body_tissues))
) %>%
rename(`log2(TPM + 1)` = avg_exp)
plot_df[is.na(plot_df)] <- 0
cl <- c('Low' = 'lightblue', 'Moderate' = 'lightgreen', 'High' = 'red')
shp <- c('dntx'=16, 'gencode'=15)
ggplot(plot_df, aes(x =allele_id, y=subtissue))+
geom_point(aes(size = `log2(TPM + 1)`, color = `Variant Impact`, shape=build))+
scale_color_manual(values =cl )+
scale_shape_manual(values = shp)+
geom_hline(yintercept = 1.5)+
ggtitle('Tissue Specific Variant Prioritization') +
theme(axis.text.x.bottom = element_text(angle = 45, hjust = 1), panel.background = element_blank())
Figure 7. Tissue specific variant prioritization of variants associated with ocular disease. Size corresponds to average TPM and color to impact. A full mapping of variant descriptors to impact can be found in supplemental data
We compared out tissue specific variant effect predictions to variant effect predictions using the gencode annotation. We found that out 226824 total variants examined,the highest priority of 117556 decreased in at least one tissue, and 1075 increased in at least one tissue. We see a larger decrease in predicted effects as the transcript which previously lead to an effect is not expressed in a respective tissue. We next selected variants that were associated with ocular disease phenotypes, to see if there were any interesting changes. Variant 280266, associated retinitis pigmentosa, while considered high impact in gencode as well, its only high impact in adult retina, so more likely to be causative. Variant 843125 associated with retinal pigmentary distrophy, as a high gencode impact, and is high impact in other tissues, low impact and not expressed in any ocular tissue, so likely not disease causing.
piu <- as_ggplot(rasterGrob(readPNG('clean_data/plots/piu_shiny_app.png'), interpolate = T))
frac_det <- as_ggplot(rasterGrob(readPNG('clean_data/plots/frac_sample_det.png'), interpolate = T))
gene_body <- as_ggplot(rasterGrob(readPNG('clean_data/plots/gene_bodies_shiny_app.png'), interpolate = T))
layout <- '
AABB
CCCC
'
piu + frac_det + gene_body + plot_layout(design = layout)+ plot_annotation(tag_levels = 'A')
Figure 8. Screenshots from dynamic de novo transcriptome visualization tool. A). PIU bar plot for selected gene and tissue. B). Exon level diagram of transcript body Thicklines represent coding region of transcript. novel exons colored in red. Tooltip contains genomic location and phylop score C) Bargraph of fraction of samples within dataset each transcript was consructed in by tissue.
Finally, we package all tools used for out transcriptome pipeline within a portable docker container with a stand-alone run script. This pipeline allows other researchers to run their own samples, and generate figures and annotations similar to what is shown here.
Motivated by the lack of a comprehensive pan eye transcriptome we created the first comprehensive set of ocular transcriptomes. By using long read RNA-sequencing data to calibrate our short-read construction pipeline, we are confident we created real, biologically relevant transcriptomes. We found that concordance between long and short read based transcriptome is directly related to transcript expression and transcript length. We saw a clear inability within this pacbio data to accurately detect transcripts shorter than 2000Bp. As many of the transcripts constructed using short reads are below this threshold, long read sequencing data enriched for smaller transcript sizes would provide greater insight in future studies. Even when accoutning for transript length, we acheived a contruction accuracy of 0.389. This was surprisingly low. It’s difficult to say whether this low concordance it due to a high degree of transcriptional noise, or errors in the short read transcriptome construction. We see many more transcripts in the long read data than the short read data which may hint to transcriptional noise; obtaining long read data from more samples and seeing the concordance across multiple samples would more accurately answer this question.
We used a large dataset compiled from published RNA-seq data to build our ocular transcriptomes, an approach which has several key advantages. First, our large sample size allows us to combat the noisy nature of RNA-seq data. More importantly, because we only keep transcripts detected across multiple samples from multiple studies with multiple types of sample preparation, we can be more confident that our transcriptomes accurately reflect the biology of its originating subtissue and are not a technical artifact due to preparation of the samples or the transcriptome construction algorithmn.
We show that across all tissue types, the number of constructed transcripts is dramatically less than the number of transcripts present in the Gencode reference annotation.(Table 1; Figure 1a) Despite the large reduction in number of transcripts in the annotation, we see little loss in gene expression data. Given this and that we recover XXX% of gencode annotation across all tissues,?? indicates that many of the transcripts in the gencode annotation are not biologically relevant, and that the one size fits all strategy of having a single comprhensive annotation may not be the by the right one. This reduction in total transcriptome sizes is also useful for other researchers as it creates a smaller search space to explore, ie its easier to compare 30000 tx vs 300000
In almost all tissues we see more noncoding novel isoforms than coding. However as all/most of our data was prepared with polyA selection, we likely are not capturing cannonical lincRNA/ncRNA which do not have polyA tails. These transcripts fall under the processed_transcript catagory by the gencode annotation which is a very vague catagory so its difficuly to predict their function. Many of these have alternaive FEs so might be unders a different regulatory program. Though we identified many novel ORF’s its difficult to say whether these are actually protein coding. We see that many of the novel ORF’s are novel because of retained introns. These retained intron transcirpt may simply be unspliced mRNA. Studies have shown that some neuronal cell types stockpile unspliced transcripts for generating rapid changes in gene expression, and so we may be capturing a simiar phenomenon here. However we are confident that we are capturring true novel protein isoforms. We find a lot, but no where near the upper limit sugessted by (20)
We observed that in the set of novel exons within our transcriptome, the vast majority are novel first and last exons. It is difficult to directly tell what the biological relevance of these might be. There are multiple studies examining the roles of alternative first and last exons, with first exons arising because of alternative promoter usage, and last exons by alternative polyadenylation. (21),(22) These studies have highlighted the distinct biological role these phenomena play. However, other studies have shown that many of these differences in start sites are largely nonfunctional and lack biological significance.(23). Previous research devoted to studying splicing vastly outnumbers other research on other RNA processing types, yet we show that isoforms generated by these mechanisms are equally abundant, and thus merit more study. Something about the reliability of determining novel 5’ ends and 3’ ends from RNA-seq/de novo transcriptomes.
In our ocular transcriptomes, we see that novel isoforms are largely subtissue-specific. This matches previously reported findings about the tissue specificity of rare exons and first and last exons. We also find that on average novel isoforms represent about 15% of their parent. However, it is difficult to say if these are non-functional. Its is entirely possible that these isoforms are from low population cell types. This especially makes sense in the retina, bipolar/amacrine cell types comprise ~15% of the total cell population, and that within these cell types there are even smaller sub cell type populations. However, we note that transcripts only expressed in rare cell types are likely to be missed by our pipeline, as we used for a conservative expression threshold for the filtering step within our pipeline
We want to make our transcriptomes easily accessible to other researchers, so we designed a webapp to visualize our transcriptomes and access tissue-specific annotation files. We wanted to provide as much information to the user so that they can draw their own conclusions about the significance of potential novel exon, and so we provide the gene model with novel exon, coding and noncoding regions marked, along with the PIU for each transcript constructed within a gene. We also provide the fraction of samples within a given subtissue type a sample was detected in, to provide a further level of evidence to the validity of constructed transcripts.
utility of de novo transcriptomes - primary isoforms, variant effect prediction, somthing about precision medicine.
novel loci found some, some are interseting, some are not.
1. Shen,S., Park,J.W., Lu,Z.-x., Lin,L., Henry,M.D., Wu,Y.N., Zhou,Q. and Xing,Y. (2014) rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proceedings of the National Academy of Sciences, 111, E5593–E5601.
2. Haas,B.J., Papanicolaou,A., Yassour,M., Grabherr,M., Blood,P.D., Bowden,J., Couger,M.B., Eccles,D., Li,B. and Lieber,M. et al. (2013) De novo transcript sequence reconstruction from RNA-Seq: Reference generation and analysis with Trinity. Nature protocols, 8.
3. Pertea,M., Pertea,G.M., Antonescu,C.M., Chang,T.-C., Mendell,J.T. and Salzberg,S.L. (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33, 290–295.
4. GTEx Consortium, Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group, Statistical Methods groups—Analysis Working Group, Enhancing GTEx (eGTEx) groups, NIH Common Fund, NIH/NCI, NIH/NHGRI, NIH/NIMH, NIH/NIDA and Biospecimen Collection Source Site—NDRI et al. (2017) Genetic effects on gene expression across human tissues. Nature, 550, 204–213.
5. Pertea,M., Shumate,A., Pertea,G., Varabyou,A., Breitwieser,F.P., Chang,Y.-C., Madugundu,A.K., Pandey,A. and Salzberg,S.L. (2018) CHESS: A new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biology, 19, 208.
6. Swamy,V. and McGaughey,D. (2019) Eye in a Disk: eyeIntegration Human Pan-Eye and Body Transcriptome Database Version 1.0. Investigative Ophthalmology & Visual Science, 60, 3236–3246.
7. Li,H. (2018) Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics (Oxford, England), 34, 3094–3100.
8. Wyman,D. and Mortazavi,A. (2019) TranscriptClean: Variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics, 35, 340–342.
9. Wyman,D., Balderrama-Gutierrez,G., Reese,F., Jiang,S., Rahmanian,S., Forner,S., Matheos,D., Zeng,W., Williams,B. and Trout,D. et al. (2020) A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv, 10.1101/672931.
10. Frankish,A., Diekhans,M., Ferreira,A.-M., Johnson,R., Jungreis,I., Loveland,J., Mudge,J.M., Sisu,C., Wright,J. and Armstrong,J. et al. (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic acids research, 47, D766–D773.
11. Dobin,A., Davis,C.A., Schlesinger,F., Drenkow,J., Zaleski,C., Jha,S., Batut,P., Chaisson,M. and Gingeras,T.R. (2013) STAR: Ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29, 15–21.
12. Li,H., Handsaker,B., Wysoker,A., Fennell,T., Ruan,J., Homer,N., Marth,G., Abecasis,G., Durbin,R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25, 2078–2079.
13. Pertea,G. and Pertea,M. (2020) GFF Utilities: GffRead and GffCompare. F1000Research, 9, 304.
14. Patro,R., Duggal,G., Love,M.I., Irizarry,R.A. and Kingsford,C. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14, 417–419.
15. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
16. Köster,J. and Rahmann,S. (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28, 2520–2522.
17. R Core Team (2019) R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria.
18. Zerbino,D.R., Achuthan,P., Akanni,W., Amode,M.R., Barrell,D., Bhai,J., Billis,K., Cummins,C., Gall,A. and Girón,C.G. et al. (2018) Ensembl 2018. Nucleic Acids Research, 46, D754–D761.
19. O’Leary,N.A., Wright,M.W., Brister,J.R., Ciufo,S., Haddad,D., McVeigh,R., Rajput,B., Robbertse,B., Smith-White,B. and Ako-Adjei,D. et al. (2016) Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44, D733–745.
20. Ponomarenko,E.A., Poverennaya,E.V., Ilgisonis,E.V., Pyatnitskiy,M.A., Kopylov,A.T., Zgoda,V.G., Lisitsa,A.V. and Archakov,A.I. (2016) The Size of the Human Proteome: The Width and Depth. International Journal of Analytical Chemistry, 2016.
21. Demircioğlu,D., Cukuroglu,E., Kindermans,M., Nandi,T., Calabrese,C., Fonseca,N.A., Kahles,A., Lehmann,K.-V., Stegle,O. and Brazma,A. et al. (2019) A Pan-cancer Transcriptome Analysis Reveals Pervasive Regulation through Alternative Promoters. Cell, 178, 1465–1477.e17.
22. Mitra,M., Johnson,E.L., Swamy,V.S., Nersesian,L.E., Corney,D.C., Robinson,D.G., Taylor,D.G., Ambrus,A.M., Jelinek,D. and Wang,W. et al. (2018) Alternative polyadenylation factors link cell cycle to migration. Genome Biology, 19, 176.
23. Xu,C., Park,J.-K. and Zhang,J. (2019) Evidence that alternative transcriptional initiation is largely nonadaptive. PLOS Biology, 17, e3000197.